Skip to content

support nemo llama 70b lora train#1084

Open
youth123 wants to merge 1 commit intoflagos-ai:main-legacyfrom
youth123:support_mlperf_nemo_v2
Open

support nemo llama 70b lora train#1084
youth123 wants to merge 1 commit intoflagos-ai:main-legacyfrom
youth123:support_mlperf_nemo_v2

Conversation

@youth123
Copy link

@youth123 youth123 commented Jan 26, 2026

PR Category
Train

PR Types
New Features

PR Description

  • Supports loading and saving checkpoints in nemo zarr format
  • Supports train packed seqs
  • Fix the issue where wandb finalization cannot find the latest_checkpointed_iteration file
  • Fix lora can not support layernorm weight load & not support nemo zarr

The checkpoint file format is as follows:
load zarr format:
-context
-weights
-module.decoder.xxx._extra_state
-module.decoder.xxx.weight
-optimizer.state.fp32_param.xxx.weight
-optimizer.state.fp32_param.xxx.weight.sync
common.pt
meatadata.json

save zarr format:
-iter_xxx
-module.decoder.xxx._extra_state
-module.decoder.xxx.weight
-optimizer.state.fp32_param.xxx.weight
-optimizer.state.fp32_param.xxx.weight.sync
common.pt
meatadata.json
latest_checkpointed_iteration.txt

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


liji seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@tengqm
Copy link
Contributor

tengqm commented Feb 3, 2026

@youth123 Please help double check if you are submitting the PR using the email address for your github account (git config user.email), also please make sure you have signed the CLA in order for your PR to be reviewed/approved. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants